Skip to content

node condtion remediation#126

Open
runzhen wants to merge 6 commits intomainfrom
runzhen/npd2
Open

node condtion remediation#126
runzhen wants to merge 6 commits intomainfrom
runzhen/npd2

Conversation

@runzhen
Copy link
Collaborator

@runzhen runzhen commented Mar 12, 2026

No description provided.

Copilot AI review requested due to automatic review settings March 12, 2026 21:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a periodic node-condition monitoring/remediation loop to the agent daemon, intended to detect the KernelDeadlock node condition and reboot the node when it occurs.

Changes:

  • Start a new daemon loop that queries the Kubernetes Node object on a 1-minute cadence and triggers a reboot when KernelDeadlock=True after host boot time.
  • Add helper functions to determine boot time, node name, and execute a reboot.
  • Update Go module dependencies (go.mod/go.sum) to include additional indirect dependencies pulled in by the new Kubernetes client usage.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.

File Description
commands.go Adds node condition monitoring/remediation loop and reboot helpers.
go.mod Adds new indirect dependencies required by the new code paths.
go.sum Records checksums for newly introduced/updated indirect dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +516 to +517
wg.Add(1)

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startDaemonLoops pre-increments the WaitGroup counter for the loops it starts, but startNodeConditionLoop also calls wg.Add(1) internally. This hidden increment is inconsistent with the other loops and makes it easier to accidentally introduce a WaitGroup misuse/panic later; consider moving the Add(1) into startDaemonLoops and removing it from here.

Suggested change
wg.Add(1)

Copilot uses AI. Check for mistakes.
Comment on lines +530 to +542
// Load kubeconfig
config, err := clientcmd.BuildConfigFromFlags("", "/var/lib/kubelet/kubelet/kubeconfig")
if err != nil {
logger.Errorf("failed to load kubeconfig: %s", err.Error())
return
}

// Create Kubernetes clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
logger.Errorf("failed to create clientset: %s", err.Error())
return
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clientcmd.BuildConfigFromFlags and kubernetes.NewForConfig are executed on every tick. Since these typically only depend on local files and can be reused, consider creating the REST config/clientset once outside the ticker loop and reusing them (recreating only on failure) to reduce per-minute I/O and allocations.

Copilot uses AI. Check for mistakes.
Comment on lines +551 to +562
node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{})
if err != nil {
logger.Errorf("failed to get node %s: %s", nodeName, err.Error())
}

hostBootTime, err := getBootTime()
if err != nil {
logger.Errorf("failed to get host boot time: %s", err.Error())
return
}

for _, condition := range node.Status.Conditions {
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Nodes().Get call returns an error, node can be nil; the code then ranges over node.Status.Conditions, which can panic. Handle the error by returning/continuing before dereferencing node (or guard against nil).

Copilot uses AI. Check for mistakes.
commands.go Outdated
Comment on lines +531 to +542
config, err := clientcmd.BuildConfigFromFlags("", "/var/lib/kubelet/kubelet/kubeconfig")
if err != nil {
logger.Errorf("failed to load kubeconfig: %s", err.Error())
return
}

// Create Kubernetes clientset
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
logger.Errorf("failed to create clientset: %s", err.Error())
return
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error paths return from the goroutine, which permanently stops node-condition monitoring after a transient failure (e.g., kubeconfig temporarily unavailable). Prefer logging and continue to the next tick so the daemon can self-recover like the other loops in this file.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 12, 2026 21:17
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +530 to +532
// Load kubeconfig
kubeConfig, err := clientcmd.BuildConfigFromFlags("", config.KubeletKubeconfigPath)
if err != nil {
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config, err := clientcmd.BuildConfigFromFlags(..., "/var/lib/kubelet/kubelet/kubeconfig") hardcodes a path that already exists as config.KubeletKubeconfigPath and also introduces a local variable named config that shadows the imported pkg/config identifier. Use the shared constant and rename the local to something like kubeConfig to avoid confusion.

Copilot uses AI. Check for mistakes.
Comment on lines +508 to +512
func rebootNode() error {
rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt",
"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")

return rebootCmd.Run()
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct exec.Command usage here will be flagged by the repo’s enabled gosec linter (this codebase typically suppresses with #nosec or routes through pkg/utils helpers). Consider using utils.RunSystemCommand / RunCommandWithOutput (to preserve stderr for debugging) or add an explicit suppression comment with justification.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI commented Mar 12, 2026

@runzhen I've opened a new pull request, #127, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 12, 2026 21:23
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@runzhen runzhen deployed to e2e-testing March 12, 2026 21:25 — with GitHub Actions Active
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +534 to +536
return
}

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this loop, errors when loading kubeconfig cause a return, which stops the goroutine permanently and prevents any future node-condition checks. Consider logging the error and continue to the next tick (or implement a backoff) so transient failures don’t disable remediation for the lifetime of the agent process.

Suggested change
return
}

Copilot uses AI. Check for mistakes.
Comment on lines +553 to +554
// Get the node
node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{})
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer cancel() is inside a long-running for loop, so the cancels will be deferred until the goroutine exits (potentially never), leaking per-iteration resources. Scope the timeout context to a small inner function/block and call cancel() at the end of each iteration instead of deferring in the outer loop.

Suggested change
// Get the node
node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{})
// Get the node with a per-call context to avoid leaking resources across iterations
ctx, cancel := context.WithCancel(context.Background())
node, err := clientset.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
cancel()

Copilot uses AI. Check for mistakes.
Comment on lines +553 to +556
// Get the node
node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{})
if err != nil {
logger.Errorf("failed to get node %s: %s", nodeName, err.Error())
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Node GET fails, the code logs the error but then continues and dereferences node.Status.Conditions, which will panic when node is nil. This should return/continue on error (and avoid attempting remediation) so the agent doesn’t crash the loop on API failures.

Copilot uses AI. Check for mistakes.
Comment on lines +508 to +513
func rebootNode() error {
rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt",
"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")

return rebootCmd.Run()
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces an unconditional host reboot action (via sysrq-trigger) in the main daemon loop. Consider gating this behavior behind an explicit config/feature flag (similar to EnableDriftDetectionAndRemediation) and adding rate limiting/guardrails to reduce the risk of reboot loops or unexpected reboots in environments that don’t want automated power actions.

Copilot uses AI. Check for mistakes.
Comment on lines +509 to +512
rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt",
"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")

return rebootCmd.Run()
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct use of exec.Command here may trigger gosec (G204) and also provides no timeout or captured stderr/stdout for troubleshooting. Prefer using the repo’s command execution helper (e.g., pkg/utils.RunSystemCommand / RunCommandWithOutput) or add a scoped #nosec with justification and use exec.CommandContext with a timeout so failures are observable and the call can’t hang indefinitely.

Suggested change
rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt",
"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")
return rebootCmd.Run()
// Use a bounded context so the reboot command can't hang indefinitely.
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// #nosec G204 -- command and arguments are constant literals; no user input is interpolated.
rebootCmd := exec.CommandContext(ctx, "/usr/bin/nsenter", "-m/proc/1/ns/mnt",
"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")
output, err := rebootCmd.CombinedOutput()
if ctx.Err() == context.DeadlineExceeded {
return fmt.Errorf("reboot command timed out: %w; output: %s", err, strings.TrimSpace(string(output)))
}
if err != nil {
return fmt.Errorf("reboot command failed: %w; output: %s", err, strings.TrimSpace(string(output)))
}
return nil

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants